Foundations of Geospatial Analysis

Professor Adam Dennett

Centre for Advanced Spatial Analysis, University College London

February 21, 2023

About Me

  • Professor of Urban Analytics & Current Head of Department @ Bartlett Centre for Advanced Spatial Analysis (CASA), UCL

  • Geographer by background - ex-Secondary School Teacher - back in HE for 15+ years

  • Taught GIS / Spatial Data Science at postgrad level for last 10 years

About this session

  • Whistle-stop tour of some of the key concepts relating to spatial data

  • An illustrative example analysing some spatial data in London - demonstrating the “spatial is special” idiom and how we might account for spatial factors in our analysis

  • All slides and examples are produced in RMarkdown using Quarto and R so everything can be forked and reproduced in your own time later - just go to the Github Repo link below

  • By the end I hope you’ll all leave with a better introductorhy understanding of how we should pay attention to the influence of space in any analysis

Key Geospatial Concepts

  • Where? (absolute)
  • Where? (relative)
  • How near or distant?
  • What scale?
  • What shape?

Where? (absolute)

  • Everything happens somewhere

    • We’re here: Wallspace, 22 Duke’s Road, Camden, London, England, *Europe, Northern Hemisphere, Earth

Where? (absolute)

  • How do we know exactly where?

XKCD - No, The Other One

https://xkcd.com/2480/

Where? Coordinate Reference Systems

  • More reliable than names (that are rarely unique or reference fuzzy locations), are coordinates

  • The earth is roughly spherical and points anywhere on its surface can be described using the World Geodetic System (WGS) - a geographic (spherical) coordinate system

  • Points can be referenced according to their position on a grid of latitudes (degrees north or south of the equator) and longitudes (degrees east or west of the Prime - Greenwich - meridian)

  • The last major revision of the World Geodetic System was in 1984 and WGS84 is still used today as the standard system for references places on the globe.

https://www.earthdatascience.org/courses/use-data-open-source-python/intro-vector-data-python/spatial-data-vector-shapefiles/geographic-vs-projected-coordinate-reference-systems-python/

Where? Coordinate Reference Systems

  • Projected Coordinate Reference Systems convert the 3D globe to a 2D plane and can do so in a huge variety of different ways

  • Most national mapping agencies have their own projected coordinate systems - in Britain the Ordnance Survey maintain the British National Grid which locates places according to 6-digit Easting and Northing coordinates

  • Every coordinate system can be referenced by its EPSG code, e.g. WGS84 = 4326 or British National Grid = 27700 with mathematical transformations to convert between them

Where? Describing and Locating Things with Coordinates

  • Once we have a coordinate reference system we can locate objects accurately in space

  • Most objects that spatial data scientists are concerned with (apart from gridded representations, which we will ignore for now!) can be simplified to either a point, a line or a polygon in that space

  • Polygons and lines are just multiple point coordinates joined together!

Where? Relative - Tobler’s First Law of Geography

“Everything is related to everything else, but near things are more related than distant things.”

  • This observation underpins much of what spatial data scientists do

  • Being able to locate something in space, relative to something else, allows us to:

    • explain why something may be occurring where it is

    • make better predictions about nearby or further away things

  • Underpins the whole Geodeomographics (customer segmentation) industry!!

Where? Relative - John Snow’s Cholera Map

Where? Relative - Defining ‘near’ and ‘distant’

  • Near and distant can mean different things in different contexts

    • the furthest one would travel to buy a pint of milk is somewhat different to furthest one might be willing to commute for a job
  • In spatial data science one way of separating near from distant can simply be to define their topological relationship - Dimensionally Extended 9-Intersection Model (DE-9IM) is the standard topological model used in GIS

  • Touching or overlapping objects = ‘near’

Where? Relative - Exploring Near and Distant

  • Near and distant in London
  • Map shows 2011 Census Wards in London, within Borough Boundaries
  • The Greater London Authority produced the London Ward Atlas - https://data.london.gov.uk/dataset/ward-profiles-and-atlas - which collates a range of demographic and economic indicators for each of these zones in the city

Where? Relative - Exploring Near and Distant

  • If we measure the distance from the centre (centroid) of one ward to another, then we might decide that the 1st, 2nd, 3rd, kth. closest wards are near, the others are far

  • These neighbour relationships can be stored in an \(n*n\) ‘spatial weights’ matrix

  • The spdep package in R

Where? Relative - Exploring Near and Distant

  • We can then decide to include the “k” nearest neighbours or exclude the rest

Where? Relative - Exploring Near and Distant

  • Other conceptions of near might include any contiguous ward with distant simply being those which are not contiguous

  • Near or distant could also be defined by some distance threshold

Analysis of ‘where’?

  • Where in London do students perform best and worst in their post-16 exams?

Is there any pattern? Do better scores and worse scores appear to be clustered? How can we tell?

Spatial Autocorrelation

  • Spatial Autocorrelation - phenomenon of near things being more similar than distant things.

    • Do neighbouring wards have more similar GCSE points scores than distant wards?
  • Can test for spatial autocorrelation by comparing the GCSE Scores in any given ward with the GCSE scores in neighbouring wards (however we choose to define our neighbours - k-nearest, those that are contiguous etc.)

  • Average value of GCSE scores in the neighbouring wards is known as the spatial lag of GSCE scores

Spatial Autocorrelation

                          (Intercept) average_gcse_capped_point_scores_2014 
                          190.2624075                             0.4190508 
  • If there is a linear correlation between the variable and its spatial lag, we can observe that values in near places do tend to cluster

Moran’s I

  • Moran’s I is another name for the least-squares regression slope parameter
  • Values range from +1 (perfect spatial autocorrelation) to -1 (perfect dispersal) with values close to 0 indicating no relationship
moran.test(LondonWardsMerged$average_gcse_capped_point_scores_2014, nb2listw(LWard_nb))

    Moran I test under randomisation

data:  LondonWardsMerged$average_gcse_capped_point_scores_2014  
weights: nb2listw(LWard_nb)    

Moran I statistic standard deviate = 17.785, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.4190507533     -0.0016025641      0.0005594495 

Moran’s I

  • Moran’s I = 0.42

  • Moderate, positive spatial autocorrelation between average GCSE scores in London - some clustering of both low and high scores

  • Spatial Autocorrelation might be expected when distribution of schools overlaid and one realises that pupils from multiple neighbouring wards might attend the same school

Explaining Spatial Patterns

  • Having observed some spatial patterns in school exam performance in London, we might next want to explain these patterns, perhaps using another variable measured for the same spatial units.

  • Our own experience might tell us that missing class could negatively impact our ability to learn things in that class

  • Hypothesis: wards where there are higher rates of absence from school might tend to experience lower average exam grades

Explaining Spatial Patterns

coef(lm(average_gcse_capped_point_scores_2014 ~ unauthorised_absence_in_all_schools_percent_2013, data = LondonWardsMerged))
                                     (Intercept) 
                                       371.71500 
unauthorised_absence_in_all_schools_percent_2013 
                                       -41.40264 
p2 <- ggplot(LondonWardsMerged, aes(x = unauthorised_absence_in_all_schools_percent_2013, y = average_gcse_capped_point_scores_2014))
p2 + geom_point() + geom_smooth(method = "lm", se = FALSE) + xlab("% Unauthoried Absence Days 2013") + ylab("Avg GCSE Score 2014")
  • Taking the whole of London, it would appear that there is a moderately strong, negative relationship between missing school and exam performance

  • For every 1% of additional school days missed, we might expect a decrease of -41 points in GCSE score.

  • But does this relationship hold true across all wards in the city?

Explaining Spatial Patterns

  • Moran’s I of GSCE scores means that we already know that the observations are probably not independent of each other (violating one assumption of regression)

  • Mapping the residual values from the regression model allows us to observe any spatial clustering in the errors

  • Clustering of residuals could also indicate a violation of the independence assumption of errors


    Moran I test under randomisation

data:  LondonWardsMerged$model1_resids  
weights: nb2listw(LWard_nb)    

Moran I statistic standard deviate = 12.183, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.2862894906     -0.0016025641      0.0005583971 

Dealing with Spatial Patterns - Spatial Regression Models (the spatial lag model)


Call:
lagsarlm(formula = average_gcse_capped_point_scores_2014 ~ unauthorised_absence_in_all_schools_percent_2013, 
    data = LondonWardsMerged, listw = nb2listw(LWard_nb, style = "W"), 
    method = "eigen")

Residuals:
      Min        1Q    Median        3Q       Max 
-68.70402  -9.44615  -0.64207   8.53417  58.56788 

Type: lag 
Coefficients: (asymptotic standard errors) 
                                                 Estimate Std. Error z value
(Intercept)                                      207.4009    15.0053  13.822
unauthorised_absence_in_all_schools_percent_2013 -30.7843     2.0792 -14.806
                                                  Pr(>|z|)
(Intercept)                                      < 2.2e-16
unauthorised_absence_in_all_schools_percent_2013 < 2.2e-16

Rho: 0.46705, LR test value: 104.93, p-value: < 2.22e-16
Asymptotic standard error: 0.041738
    z-value: 11.19, p-value: < 2.22e-16
Wald statistic: 125.22, p-value: < 2.22e-16

Log likelihood: -2581.93 for lag model
ML residual variance (sigma squared): 217.21, (sigma: 14.738)
Number of observations: 625 
Number of parameters estimated: 4 
AIC: 5171.9, (AIC for lm: 5274.8)
LM test for residual autocorrelation
test value: 3.0949, p-value: 0.078537
  • One way of coping with spatial dependence in the dependent variable is to include the spatial lag of that variable as an independent explanatory variable

  • the spatialreg package in R allows us to easily incorporate a spatial lag of the dependent variable as an independent variable \(\rho\) (Rho) in a standard linear regression model

  • Running the spatial lag model reveals that the spatial lag is statistically significant and has the effect of reducing the estimated impact of missing 1% of schools days from -42 points to -31 points.

Dealing with Spatial Patterns - Spatial Non-Stationarity

  • One reason behind a clustering of residuals could be that the relationship between dependent and independent variables might not remain constant across space

    • In some parts of London, it could be that as unauthorised absence from school rises, exam grades also rise (as unlikely as that might be!).

    • Or, more plausibly, that in some parts of the city, absence has an even more pronounced negative effect than in others.

    • It’s also likely that the intercept values (the average value of GSCE rules, given no days of unauthorised absence) will be different in different parts of the city - some areas, on average, doing better than others

  • We can test for the presence of such phenomena by running a series of smaller, more localised regressions and comparing the coefficients that emerge

Geographically Weighted Regression

  • GWR is a method for systematically running a series of localised regression analyses across a study area, collecting coefficients and other diagnostics for an independent variable in each zone of interest.

  • Something similar can be achieved through spatial sub-setting - i.e. running analyses for groups of zones within a higher level geography

Geographically Weighted Regression

Geographically Weighted Regression

  • In a GWR analysis, kernel weighting functions of different bandwidths (diameters) and shapes are used to include and weight or exclude neighbouring observations

  • Adaptive weighting can be used to adjust the size of the kernel according to some threshold of observations

  • For every point in the dataset a regression is run including the values within the kernel (which, of course, can only be achieved effectively through understanding the coordinate reference system of the observations)

Geographically Weighted Regression

  • Plotting coefficient values for each ward reveals noticable non-stationarity in the relationship between absence and GSCE scores

  • In well-off central London boroughs (particularly Hammersmith and Fulham, Kensington and Chelsea and Camden) we see evidence that absence is positively related to GCSE performance

  • In some of the outer-London boroughs (Barnet, Sutton, Richmond etc.) the effect of missing school is even more severe than it is elsewhere in the city

Modifiable Areal Units and Ecological Fallacies

  • Methods which accommodate space explicitly can help us better understand spatial phenomomena, but the arrangement of space can alter perceptions and the outcomes of analyses

  • The Modifiable Areal Unit Problem (MAUP) - popularised in the 1980s by Stan Openshaw - describes issues that relate to the shape, scale and aggregation of underlying phenomenon to artificial spatial units

  • Politicians have known about the issues of scale and aggregation for a long time and have used it to their advantage

  • The practice of Gerrymandering is widespread wherever there is a first-past-the-post electoral system and has been used to manipulate vote counts to influence election outcomes

Modifiable Areal Units and Ecological Fallacies

  • Related to the MAUP, the Ecological Fallacy describes a confusion between patterns revealed at one level of aggregation and the assumption that they apply either to individuals or lower levels of aggregation

  • The basic idea that just because a patterns of educational attainment are revealed at Borough level, they translate down to neighbourhood levels

  • “Simpson’s Paradox”

Conclusions